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Abstract. In this paper, we first consider the parameter estimation of a multivariate random process 
distribution using multivariate Gaussian mixture law. The labels of the mixture are allowed to 
have a general probability law which gives the possibility to modelize a temporal structure of the 
process under study. We generalize the case of univariate Gaussian mixture in [1] to show that the 
likelihood is unbounded and goes to infinity when one of the covariance matrices approaches the 
boundary of singularity of the non negative definite matrices set. We characterize the parameter 
set of these singularities. As a solution to this degeneracy problem, we show that the penalization 
of the likelihood by an Inverse Wishart prior on covariance matrices results to a penalized or 
maximum a posteriori criterion which is bounded. Then, the existence of positive definite matrices 
optimizing this criterion can be guaranteed. We also show that with a modified EM procedure or 
with a Bayesian sampling scheme, we can constrain covariance matrices to belong to a particular 
subclass of covariance matrices. Finally, we study degeneracies in the source separation problem 
where the characterization of parameter singularity set is more complex. We show, however, that 
Inverse Wishart prior on covariance matrices eliminates the degeneracies in this case too. 



INTRODUCTION 

We consider a double stochastic process: 

• A discrete process {zt) t =i..T^ with z t taking its values in the discrete set Z = {I..K}. 

• A continuous process (s t )t=i..T which is white conditionally to the first process 
(zt)t=i..T and following a distribution: 

p( s \ z ) = f(s;Q 

In the following, without loss of generality of the considered model, we restrict the 
function /(.) to be a Gaussian: f(.\z) = N(n z ,R z ). 

This double process is called in literature "Mixture model". When the hidden process 
Zi.,t is white, we have an i.i.d mixture model: p(s) = J2 Z PzN{ix, z , R z ) and when Z\_,t 
is Markovian, the model is called HMM (Hidden Markov Model). For application of 
these two models see [2] and [3]. Mixture models present an interesting alternative to 
non parametric modeling. By increasing the number of mixture components, we are able 
to approximate any probability density and the time dependence structure of the hidden 
process z\__t allows to take into account a correlation structure of the resulting process. 
In the following, for clarity of demonstrations, we assume an i.i.d. mixture model. 



CHARACTERIZATION OF LIKELIHOOD DEGENERACY 



We consider T observations (s t )t=i..T of a random n-vector following a multivariate 
Gaussian mixture law: 

K 

P( s t) = Yl P^( S U Vz> R *) 

2 = 1 

Where p z = P(Z = z) is the probability that the random hidden label Z takes the value 
z G Z = {1..K}, /x 2 is the n-vector mean of the Gaussian component z and R z its 
n x n covariance matrix. We intend to estimate the parameters 6 Z = (p z , fi z , R z ) z ei..K 
by maximizing its likelihood given the observations Si.. T = [s t ]t=i..T- 

= argmaxp(si..r | 0) 
oe® 

Where 
and 

@ = {o z = (p z , » z , R z ) \ Pz e R+, EtiPz = l;R z e K- ^ e R n } (1) 

1Z is a closed subset of covariance matrices. Some examples of 1Z are considered later 
in section 4 and in [4] . 

Proposition 1 [Likelihood function is unbounded]: V si.. T e (R n ) T , 3 a singularity 
point S in the parameter space such that: lim p(s 1 „ T \0) —oo. These points are the 

— >G S 

9 = (p z , fi z: R z ) ze z such that, at least one of the R z (but not all of them together) is a 
singular non negative matrix and the correspondent mean fi z lies in the intersection of 
n — rank(R z ) hyperplans of R n . 

Proof: Let z E Z and R Zo be a singular NND matrix of rank p < n. R Zo can be 
diagonalized in the orthogonal group: 



R Z0 = U T AU, A = 



p(si..t i o) = ni>* \^R*t i/2) 



■p+i 



Consider now a sequence of positive definite matrices ( R^ ) defined by: 

V /nGN 



r A (») 



R (n) = JJT 



A 



(n) 
n— p 



A n - 



•p+i 



A, 



With the (n — p) strictly positive numeric sequences ( A-"^ ) which tend to 0. 

V / i=l..(n— p) 

Thus the sequence of ( ) converges to R zo . Likelihood function evaluated at 
R^ is: 



\{s t -» Z0 ) T R^-\s t -»J 



Pu(si..t \0) = U[ P Z0 \2nR^\^exp 
t=i ^ 

z 

Expending the exponent of the component z in canonical form 



2 

z) si 



A 



We can see that when the eigenvalues (A^)j = i ( ra _ p ) tend to zero, or equivalently, when 

(n) 

the co variance R Z() tends to R Zo and when fj, lies in the intersection of the hyperplans 
(Hi = {n | [U(s t — = 0}) i=1 r n _ p \, the likelihood function goes to infinity. So we 
have proved that any singular NND matrix is a point of degeneracy provided that the 
means lie in specific hyperplans. In one dimensional case, this corresponds to the fact 
that a goes to zero and the correspondent mean coincides with one observation. 

Figure 1 shows an example of this degeneracy. In this example, we take an original 
distribution of a 2-D random vector which is a mixture of 10 Gaussians. The Gaussians 
have their means located on a cercle and have the same covariance. Figure 1-a shows the 
graph of this distribution from which we generated 100 samples in order to estimate its 
parameters. Figure 1-b shows the estimated distribution. We can note the failure of the 
maximum likelihood estimator and its tendency to converge to sharp Gaussians. 

Here, we highlight the effect of growing the dimension n which increases the occur- 
rence of degeneracy. We have, for n > 1 an infinite number of singularities. Moreover, 
even if we fix the means of the mixture components, the unboundedness of likelihood 
might occur if some covariances go to particular singular matrices . But, we think that 
this second kind of degeneracy is less likely to happen particularly if the number of 



samples grows. We note that the occurrence of degeneracy increases when the dimen- 
sion grows and decreases when the number of samples grows. 
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Fig-1. Failure of the ML estimation of the parameters of a 10 component Gaussian 
mixture distribution. 



This degeneracy was noted by many authors (Day, 1969 [5]) and in (Hathaway 1986 
[6]), a constraint formulation of the EM algorithm has been proposed to eliminate 
this degeneracy. In (Ormoneit 1998 [7]), a penalization by an Inverse Wishart prior 
was employed to eliminate it. Our contribution leads to the same penalization but in 
different manner. In (Ormoneit 1998), the Inverse Wishart prior was chosen because it is 
a conjugate prior. In the one dimensional case [1], the penalization by an Inverse Gamma 
prior on variances was used to eliminate degeneracy. 

In this work, after characterizing the origin of these singularities, we extend this 
procedure to the multivariate case to propose an Inverse Wishart prior on covariances 
R z which guarantees the boundness of the likelihood: 



where K is a normalization constant, a and (3 two strictly positive constants which con- 
tain a priori information about the power level (scale parameter) and J is a positive 
definite symmetric matrix which contains a priori information on the covariance struc- 
ture. In fact, the mode of this law is given by: 



BAYESIAN SOLUTION TO DEGENERACY 



Pa,f3,j( R z) 



K 



exp [— aTr (R z 1 J 



■)] 



dR z 



-pR^ + aR^JR; 1 = 



Leading to: 




Proposition 2: V Si..t G (R n ) T , the a posteriori distribution p(6 | Si..t) with the a 
priori: 



p(») = IIp«.a.*W 



z£Z 

is bounded and goes to when one of the covariance matrices R z approaches a singular 
matrix. 

Proof: The penalized likelihood is: 

p( Sl .. T \o) P (e) = f[ ((l[p(R z )y/ T J2pzM(» z ,Rz) 

t=i \ zez z 
For each label z, we have the following inequality: 

{\[p{R z )f/ T N{^ r z ) < n rfk ex P [-«. Tr J,)] 

Thus, to prove the proposition, we need to show that V a > 0, b > and R s a singular 
matrix, we have: 

lim —?—exp\-aTr(R- 1 j)]=0 
R z ^Rs\R z \ b 1 

Using the inequality 

(detA) 1/n < -Tr(A) 
valid for any real symmetric n x n matrix A, We have: 



-^exp [-aTr (#/./)] < -i— exp 



—an- 



In the above inequality, the right hand side term goes to zero when R z approaches the 
boundary of singularity. Therefore, the penalized likelihood is bounded and is null on 
the boundary of singularity. 

At this point, we can also follow the arguments in [4] to prove the existence of positive 
definite matrices corrresponding to the modes of the penalized likelihood. Figure 2 
illustrates the regularization effect of this penalization. Here we used the same samples 



generated for the figure 1 and estimated the parameters of the mixture by optimizing the 
penalized likelihood criterion. The probability of degeneracy is zero. 




Original distribution Penalized EM Estimated 

distribution with 100 samples 



Fig-2. Regularization effect of the penalized EM algorithm. 



ESTIMATION OF STRUCTURED COVARIANCE MATRICES 

In this paragraph, we generalize the work in [4] to estimate covariance matrices of 
specified structure in the mixture case. The constraints are summarized in the closed 
subset 1Z introduced in the definition of the parameter set © (1). 



Unconstrained case: 

The unconstrained case was treated in many works. In [7], three methods were 
proposed: Averaging, maximum penalized likelihood and Bayesian sampling. We briefly 
recall the EM algorithm and the Bayesian sampling which both can be seen as data 
augmentation algorithms: 

• EM algorithm: It consists of two steps: 

(i) E (Expectation)-step: Consider observations si T as incomplete data 
and (si..t,zi..t) as complete data and compute the functional Q(6\6^) = 
E{\ogp{ Sl .. T , z x .. T 1 6») + \ogp{0) | Sl .. T , 0Wy, 

(ii) M (Maximization)-step: Update (fc+1) = argmax Q(0 | (fc) ). 

• Bayesian sampling: It consists of two steps: 

(i) Generate zfp ~ p(z x .. T | s L . T , (fc) ); 

(ii) Generate k+1 ~ p(6 1 s L . T , zf% x) ). 

In the unconstrained case, one obtains, in both first steps of the above algorithms, 
functionals which have only one maximum obtained by canceling the gradient to zero. 



Constrained case: 



In both EM algorithm and Bayesian sampling methods presented above, the second 
step which consists in updating was unconstrained. We see in the following how we are 
able to combine the data augmentation algorithms with the iterative gradient algorithm 
proposed in [4] to constrain the covariance matrix R z to be in the closed set TZ. 



Strutured EM 
The functional Q(0\0 (k) ) can be decomposed as follows: 

K 

Q(6\0 {k) ) = ^2g(R z ,S z ) + f(p,fi\0 ik) ) 

z=l 

with: 

g(R z , S z ) = -(1 + A) ^ \R Z \ _ Tr (i*; 1 (S z + 

tf, = ELp(*(*) = *l«(*)»* ( * } ) 

and S z the weighted sample covariance matrix: 

s _ EL(£g) - M fc+1) ) («(*) - tf +1) y P (z(t) = z | a (t), e ik) ) 

Thus, the maximization of Q with respect to R z is equivalent to the maximization of 
g(R z ,S z ) with respect to R z . The necessary gradient equations are: 

5g(R z ,S z )=Tr^R;\S z + ^)R; 1 -(l + ^)R; 1 )5R z S) ) =0 (2) 

In the unconstrained case, the solution of (2) is R z = ^ . Constraint maximization 

of g with R z £ 1Z for any 1Z is not easy. However, if 1Z is such that R e 1Z 5R G 72. 
(for example the set of Toeplitz matrices) then we replace the second step of the EM 
algorithm by the following: 

1. Find D z k+V> belonging to 1Z so that g(R z k \S z — D z k+1 ^) satisfies the necessary 
gradient conditions. 

2. Put Ri k+1) = R {k) + D? +1) 

This modification preserves the monotonicity of the EM algorithm and makes the prob- 
lem linear in D z and so it is easier to impose constraints with the condition that the 
variation of R z still belongs to 1Z, which is true for a wide range of constraints such in 
the Toeplitz case. 



Structured Bayesian sampling 

We propose the following Bayesian sampling scheme: 

1. Generate z* lT ~ p(zi.. T | Si..r, (fc) ); 

2. Generate D^ z +V> belonging to 1Z according to the a posteriori distribution 

p(D z \ Sl .. T , zl_ T ) ~exp \g{Rf\S z -Df +1) ) . 

3. Update R z k+1) = R z k) + D { z k+1) 

S z is the sample covariance depending on the partition defined by z\ T : 

q _ T,teT z s ( t ) s ( t )* 

°*>~ Card(T z ) 

T z = {t\z(t)=z} 

To be sure that the sampling keeps D z in the closed set 1Z, we define a basis (Qi)i=\..l 
of 1Z and we sample the projection of D z on 71: X\,, L ~ p(x 1 „ L | Si..t, ^*..t)' where the 
vector £Ci.. L is defined as: 

L 

D z = s ^x l Qi 

i=i 



MIXED SOURCES 

We consider now the case where sources are not directly observed, but mixed with an 
unknown mixing matrix A and we want to take into account measurement errors so that 
observations are modeled by the following equation: 

x(t) = As(t) + n(t) 

In this section, we show that when we are interested in estimating jointly the mixing 
matrix A, noise covariance matrix R t and the parameters of the mixture, by maximiz- 
ing the likelihood p(xi T \ A, R e , Z ), we encounter the same problems of degeneracy 
mentioned above. Likelihood function has the following expression: 

T K 

p{ Xl .. T | A, R e , Z ) = Y[^2p z (t)Af(Afi z , AR Z A* + R e ) 

t=l z=l 

with0 z = (/li 2 , R z ,p z ). 

The expression p z (t) = ^ p{z\..t) represents the marginal law of z(t). Indeed, 

Z!..T,z(t)=Z 

the hidden variables do not need necessarily to be white and so the mixture to be i.i.d. 
We can rewrite the expression of the likelihood in a more general form in which the 



marginalization is not performed : 



T 

p( Xl .. T | A, R t , Z ) = ^p{ Zi ..t) Y[Af(Afi z , AR Z A* + R e ) 

Z1..T t=l 

It is obvious, under this form, that degeneracy happens when one of the terms constitut- 
ing the sum tends to infinity and this is independently of the law p(zi..t)- 

Consider now the matrices T z = AR Z A* + R e . It's clear that degeneracy is produced 
when, among matrices T z , at least one is singular and one is regular. We show in the 
following that this situation can occur. 

We recall that the matrices R z and R e belong to a closed subset of the set of the 
non negative definite matrices. Constraining matrices to be positive definite leads to 
complicated solutions. The main origin of this complication is the fact that the set of 
positive definite matrices is not closed. For the same reason, we don't constrain the 
mixing matrix A to be of full rank. 

Proposition 3: V A non null, 3 matrices {T z = AR Z A* + R e for z — 1..K) such that 

{z | T z is singular} ^ and {z | T z is regular} ^ 0. 

R e is necessarily a singular NND matrix and Card({z | R z is regular}) < K. 



Proof: Without affecting the generality of the problem, we show how to construct 
a singular matrix Ti and the others matrices T z regular. We consider NND matrices. 
Therefore, the kernel of the correspondent linear mapping coincides with its isotropic 
cone. Thus, we have: 

Ker(T z ) = Ker(AR z A*) n Ker(R e ) 

It is sufficient to prove the existence of R e and (R z ) z =i..k that verify the following 
condition: 

/ Ker(AR 1 A*)nKer{R £ ) ^ {0} 

\ Ker(AR z A*) n Ker(R e ) = {0},z = 2..K Pj 

If the matrix R e is regular, there is no degeneracy: According to the mini-max principle 
applied to the characterization of the eigenvalues of the sum of two hermitian matrices, 
the eigenvalues of T z are greater than those of R e and then strictly positive which imply 
that all of the matrices T z are regular. 
We have: 

Ker(A*) C Ker(AR z A*), z = 1..K (4) 

Equality holds if R z is regular or if Ker(R z ) n Im(A*) = {0}. Note that if all the 
matrices R z axe regular, there is no degeneracy. 

Suppose then that the matrices R z , except the first matrix Ri, are regular. We will 
try now to construct the matrices i? x and R e making the condition (3) verified. Let a 



non null vector x s belong to [ifer(A*)]- L . There exist NND matrices Ri and R e such 
that x s G Ker(AR 1 A*) n Ker(R e ). In fact, consider the family of vectors (xj)j e j 
belonging to Ker(A*) such that the family {x s } U (xj)j e j is orthogonal (this is insured 
by the principle of the incomplete basis). The matrices i?i = J2jeJ a i x i x j ( a i — ^) 
and il e = Yjj&jfli x i x *j (Pj > °) are sucn mat 33 * e Ker(AR l A*) n Ker(R e ) by 
construction and ifer(AR 2 A*) n Ker(R e ) = {0}. We have then constructed matrices 
which verify the degeneracy condition. 

Note that the fact that the matrices R\ and R e are singular is a necessary condition 
but not sufficient; the matrix Ri can be singular with Ker(ARiA*) = Ker(A*) and so 
there is no degeneracy, or as well, R t is singular but Ker(AR z A*) n Ker(R £ ) ^ {0}, 
G {1..K}, which implies that all matrices T z are singular and so no degeneracy 
occurs. 



DEGENERACY ELIMINATION IN THE MIXED CASE 

In the light of what we presented in the two first paragraphs, one possible way to 
eliminate this degeneracy consists in penalizing the likelihood by an Inverse Wishart a 
priori for covariance matrices. In fact, we know that the origin of degeneracy is that 
the covariance matrices R z and R e approach the boundary of singularity (in a non 
arbitrary way). Thus, if we penalize the likelihood such that when one of the covariance 
matrices approaches the boundary, the a posteriori distribution goes to zero, eliminating 
the infinity value at the boundary and even forcing it to zero. 

Proposition 5: V X\.. T £ (R m ) T , the likelihood p(x 1 „ T I 9 Z , Re, A) penalized by an 
a priori Inverse Wishart for the noise covariance matrix R t or by an a priori Inverse 
Wishart for the matrices R z is bounded and goes to when one of the covariance 
matrices approaches the boundary of singularity. 



Proof 5: The proof is based upon the proof of the proposition 4, except the fact 
that here the a priori is not directly related to the matrices T z = AR Z A* + R e , but 
to covariance matrices R z or R e . Then, we have the following alternative: 

• If one penalizes by an a priori Inverse Wishart on the matrix R e , we have the 
following inequality: 

{p{R e )f' T N{A^T z ) < ^A__JLexp [-aTr^- 1 J)] 

Now according to the mini-max principle applied to the characterization of eigen- 
values, we have: 

|r 2 | = \ AR Z A* + R e \ > \R e \ 



which yields the following inequality: 



{p{R t )f' T N{A^T z ) < ^exp [-aTriR; 1 J)] 

This insures the convergence to of the penalized likelihood when R e goes to a 
singular matrix and insures, as well, the elimination of degeneracy which one the 
necessary conditions is the singularity of the covariance R e . 
• If we penalize only by an Inverse Wishart prior on the matrices R z with an uniform 
a priori on the matrix R e , we have a similar inequality: 

{p{R z )f/ T N{A^T z ) < \ AR l zA ^ bz exp l-a z trace{R- l J z )} 

Here, the only query is that the determinant \A\ goes to zero faster than the 
exponential of \R Z \ but, in this situation, the degeneracy condition (3) is not verified 
because of the inclusion relation (4). 

CONCLUSION 

The set of parameter singularities which characterizes the likelihood degeneracy of a 
multivariate Gaussian mixture is identified. A Bayesian solution to this degeneracy is 
proposed. We proposed a modified version of the data augmentation algorithms which 
allows to account for some constraints on the structure of the covariance matrices 
of the Gaussian mixture distribution. It consists essentially in the introduction of an 
inverse iteration to make the problem linear with respect to the matrix estimate. The 
case of source separation with Gaussian mixture model sources is also considered and 
discussed. 
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